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Abstract 

An algorithm for fast calculation of the Coulombic forces and energies 
of point particles with free boundary conditions is proposed. Its calcu- 
lation time scales as NlogN for N particles. This novel method has 
lower crossover point with the full 0{N'^) direct summation than the Fast 
Multipole Method. The forces obtained by our algorithm are analytical 
derivatives of the energy which guarantees energy conservation during a 
molecular dynamics simulation. Our algorithm is very simple. An MPI 
parallelised version of the code can be downloaded under the GNU Gen- 
eral Public License from the website of our group. 

1 Introduction 

The computation of the electrostatic or gravitational interaction of a large num- 
ber A'^ of point particles is a central problem in many fields of physics, such as 
molecular dynamics [1] and astrophysics [2]. 

If the charge distribution is nonperiodic, one frequently employs the fast 
multipole method (FMM) [3]- [8]. Its computation time scales as 0{N). Its 
point of the crossover with full direct calculation depends on the accuracy and 
is estimated to be 10^ - 10'' in [5] or 10'' - 10^ in [9]. The Barnes and Hut 
hierarchical tree method [10] is advocated for the use with low accuracy (0.1 — 
1%) in [11]. 

The cutoff methods were recently revived [12], [13]. They also scale as 0{N), 
but because of the immense pref actor they are only used for special systems. 

If the charge distribution is periodic, one can use the Ewald method [14]- 
[16], that scales as 0{N^/'^). It is good for up to 10'^ — 10^ particles and can 
achieve high accuracy; it is clearly faster than full direct summation of the 
periodic particle images. 

The Ewald method was further improved by the use of Fast Fourier Trans- 
form; that led to the particle-mesh schemes [17]- [28]. These algorithms scale as 
O(A^logA^). The same scaling was obtained by combining the Ewald method 
with the FMM for the calculation of the short range forces [29] and, alterna- 
tively, by using nonuniform FFT in the Ewald method [30]. The particle-mesh 
schemes are suitable [11] for relatively high RMS force error of about lO^'' and 
A^ = lO'' ~ 10^ 

The FMM method can be applied in the periodic case too [6], [22], [31] - 
[33] especially if one needs high precision. Due to its better scaling, it becomes 
preferable in the limit of large A^. However for not so big A^ the particle-mesh 
schemes are faster. The exact location of the crossover point depends both on 
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the system studied and the computer used [11], [33]. The usual estimate for the 
crossover is around N = 10^ — 10^. 

It is mentioned in [1], [34] that in the FMM the forces are not equal to the 
negative analytical gradients of the energy. Therefore if one uses the FMM for 
an MD simulation, the total energy will change during the simulation, unless 
very high precision is used - see e.g. Fig.l of [32]. Thus it is impossible to do 
simulations in the microcanonical ensemble. 

In the particle-mesh schemes it is possible to have conservation of either 
energy or momentum, but not the two together [20], [21]. The particle-mesh 
schemes are usually also easier to code, compared to the FMM. 

Much attention has been given to the parallelization of the particle-mesh 
[22], [25], [35]- [37], and especially FMM [38]- [41] algorithms. In the case of 
FMM, the parallelization is more difficult technically, but more efficient. Recent 
papers [42], [43] cite speedups of over 100 for 10^ particles for a state of the art 
parallel implementation of FMM. In contrast, the parallel speedups of the simple 
particle-mesh methods of [35], [37] do not exceed 10 for 10^ particles; there is 
probably room for improvement. 

Many systems ranging from those in the electronic structure calculations of 
molecules to those of astrophysics require the use of free boundary conditions 
(BC). Therefore the imposition of artificial periodicity that is needed for the 
Ewald-type methods can lead to complications [44], [45]. It would be useful to 
have a particle-mesh-like method that could handle free BC. 

Such algorithms were set forth in [46], [47], [48]. The Hockney method [17] 
was intended for applications in plasma physics where high precision is not 
required. Martyna and Tuckerman [47] used the SPME method of Essman 
et al. [20], along with a special scheme to handle free BC. That scheme was 
also used for the solution of Poisson equation for smooth charge distributions 
occurring in electronic structure calculations. However, the method of [47] has 
the disadvantage that the accuracy decreases if one approaches the border of 
the cell. Thus one needs to take bigger cells that leads to an increase of the 
CPU time needed. 

Another line of the particle-mesh like algorithms stems from the Fast Fourier 
Poisson method [19], [25]. In that case one solves the Poisson equation for the 
auxiliary charge distribution that is made up from Gaussians centered at the 
original particle positions. The resulting charge density is then projected onto a 
grid and the thus discretized Poisson equation is solved via an FFT. The forces 
are obtained by analytical differentiation of the energy expression; this solves 
the aforementioned problem of energy conservation. This method is believed to 
be accurate but slow [1] compared to the standard particle-mesh schemes such 
as P3M [17] or SPME [20]. 

It was proposed in [49] to use multigrid [50] for the solution of the Poisson 
equation, instead of the Fast Fourier Transform. This technique was further 
developed in [25] , [27] , [28] . The CPU time in the multigrid algorithm scales as 
0{N); however due to large prefactor it becomes preferable to the FFT-based 
methods only for a very large N and/or on a massively parallel computer. 

The above Fast Fourier-Poisson methods are only applicable with the pe- 
riodic BC. However, in the new method of Sutmann and Steffen [48] the free 
BC were accounted for by calculating the potential at the boundaries with the 
FMM algorithm. This algorithm may be a competitor to the FMM algorithm 
in the future. Another particlc-mesh-like algorithm that uses an iterative solver 
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instead of FFT is described in [26] . 

In a recent paper [51] we proposed a new algorithm for the solution of the 
Poisson equation for smooth charge distributions with free BC. It is also MPI- 
parallelised. 

In the present paper we propose a combination of the Poisson solver with 
free BC [51] with the Fast Fourier Poisson method [19]. We call it particle- 
particle particle-scaling function (P3S) algorithm to emphasize the relation to 
the particle-mesh schemes. Our method can achieve high accuracy and speed 
and compete with the FMM schemes in the range of particle numbers N = 10'^ — 
10^ that is important in many applications. The approximate forces resulting 
from our method are exact (negative) analytic gradients of the approximate 
energy, which allows the conservation of energy during MD runs. 

To calculate the short-range forces, we make a linked list [15]. Then, follow- 
ing [52], [53], we rearrange the particles to optimize the cache performance. 

An MPI parallelised version of our code can be downloaded under the GNU 
General Public License from the website of our group, 
http:/ /pages. unibas.ch/comphys/comphys/SOFTWARE/index. html. 

The paper is organized as follows. Section 2 describes the Ewald construc- 
tion in free BC. In Section 3 we briefly review the Poisson solver of [51] for 
the calculation of the long range Ewald forces and energies for free BC . In 
Sections 4 and 5 we discuss the cutoffs for the short and long range forces, re- 
spectively. In Section 6 we describe the linked cell list for the acceleration of 
the short range force calculation. Section 7 contains the final formulas for the 
interparticle forces. In Section 8 we give the results of our code for systems 
of 1000-20000 particles and give the graphs of the optimal parameter values 
for a given accuracy. Section 9 describes the results of an MD simulation of a 
1000-particle NaCl crystal that demonstrates the energy conservation property 
of our algorithm. Finally, in Section 10 we describe an efficient parallclization 
of our algorithm and present the parallel speedup results on a CRAY XT3. 



2 The Ewald construction in free boundary con- 
ditions [22]. 

The total electrostatic energy of N point charges in free BC is given by 

Adding and subtracting the term corresponding to the electrostatic energy 
of smooth point charges with density Pi(r), we get: 



I r-r' 2^1 r - r' 
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The Ewald choice for the screening charge distribution is 

p,(r)=7(r-r,), ^{v) ^ Q.,{G^ / nf'^ e^-G^A- (1) 

We have also experimented with the variant 7(r) ^ ^(?'cut ^ r^)™; to = 4, 8, 16, 
where the factor A normaUzes the charge to one. However, the CPU time spent 
by the program with this screening distribution was shghtly longer than with 
the Gaussian for the same accuracy. 

The sum of the screening charge distributions is 

TV 

p(r)=^p,(r). (2) 

i=l 

Then, (1) assumes the form 

U = -Eshort + £^long — ^^solf , (3) 

N N 



short 



■' — 1 j^i \ V / 



i?io„g = i/4^drdr', i?,,,.^Vg,?. (5) 



2j |r-r'| ' V2^^ 



3 Calculation of the long range energy using the 
interpolating scaling functions. 

The long range term i^iong in (5) is nothing but the electrostatic energy of the 
smooth charge distribution p(r). Therefore it can be evaluated by a suitable 
Poisson solver. We make use of the one from Ref. [51] with free BC. It amounts 
to expanding the screening charge density (2) in a real space basis defined on 
the grid with spacing h: 

p{v) « p(r) = ^pi$i(r), iEEE (ii,z2,i3), (6) 

i 

$i(r) = <^{xlh-ii)^y/h-i2)^z/h-i^). 

It was suggested in [51] to take the interpolating scaling functions [54], [55] of 
high order (up to 100) as the basis functions (j){x). The scaling functions of the 
order N interpolate the polynomials of the order N exactly and are reasonably 
localized. Therefore they can interpolate a Gaussian very well. 

On the other hand, since the scaling functions are cardinal, we obtain for 
the coefficient in (6): 

Pi = Pi'^h). (7) 

The action of calculating this screening distribution on the grid is called the 
charge assignment in the standard P3M schemes. 

Consider the potential that arises from the approximate charge distribution 
p{r) in (6): 

^(r) = / r^dr'. (8) 
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At a grid point j, this potential has the form 



i 

where the kernel 

is computed in [51]. 

From this moment on we can use the grid sum approximation to the long 
range energy: 

^3 ^5 

i ij 

The latter sum is a convolution that can be calculated via FFT techniques [51]. 
The kernel Ki is calculated only once at the beginning of a calculation and 
does not change. Thus the use of high order of interpolation does not lead to a 
significant increase of calculation time [51]. 



4 The long-range part cutoff. 

The density array in (9) is defined via (7): 

N 

Pi = p{ih) = ^7(i/i~ rj). 

However, the Gaussian (1) is a quickly decaying function. Therefore, one can 
make the summation in the above equation only for the charges within the 
distance Xcut from the grid point: 

Pi« 7(i/i-r,). (10) 

ih-rj|<a:c-ut 

The cutoff distance is chosen so that the function (1) has a small value at the 
radius Xcut- 

Actually, the sum in (10) is calculated slightly differently in our program. 
The charge position Vj is replaced by the position hij of the grid point that is 
nearest to it: 

Pi« ^ 7(Mi-i.))- (11) 

The error entailed by this is negligible. Then we precalculate and store the 
square roots of integers in order to determine to which gridpoints a given charge 
contributes. This is faster than calculating square roots of real numbers and 
truncating to integers in (10). 

As a result the calculation time of the charge spreading and long-range 
force interpolation is reduced allmost by a factor of two compared to a simple 
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summation over a rectangular domain. Note that this is only possible because in 
the FFP algorithm the charge density assigned to the grid is made of Gaussians. 
In other particle-mesh schemes B-splines are used instead, so our method would 
be inapplicable there. 

Making the cutoff approximation in (9) gives us, finally; 



The long-range energy of our algorithm is given only by the formulas (11), (12). 
Our algorithm is thus as simple in formulation as the Fast Fourier Poisson 
method [19]. On the other hand, the properties of interpolating scaling func- 
tions allow us to combine fast Gaussian charge assignment and very high order 
interpolation. This is to be contrasted e.g. to the use of Lagrange interpolation 
in the PME method [18] where the values of the interpolating functions have to 
be actually computed at the particle positions. A good account of this method 
is given e.g. in Ref. [57]. If one uses Lagrange interpolation of the order L, 
then the length of the Lagrange interpolation filter is i + 1 and the number of 
grid points to which a single charge is assigned is thus (L -I- 1)^. For large L 
this means that a lot of CPU time will be spent on the charge assignment and 
force back interpolation. To avoid this, in the PME one uses only low order 
interpolation. 

However, we would like to repeat that our method treats systems with free 
BC whereas the standard particle-mesh ones are primarily for the periodic sys- 
tems. 

To illustrate the accuracy of our approximation, we calculate the self-energy 
of a single unit Gaussian (1) by the formula (12). The Gaussian is initially 
at a grid point and then gradually shifted by two grid constants to see the 
dependence of the error on its position relative to the grid. We have Xcut = 4.5/i 
and h = 0.7; these are typical parameter values that we use in other tests . 

The graph of the resulting energy error vs. the shift distance is given at Fig. 
1. We see that the error is rather smooth; there is no visible jumps coming from 
the cutoff approximation. The energy oscillation is due to the grid discretization; 
its amplitude goes to zero in the limit h ^ 

5 The short-range part cutoff. 

The short-range pairwise potential in (4) is a rapidly decaying function. There- 
fore, it is standard to introduce the cutoff radius rcut beyond which the short- 
range potential is approximated by zero: 



To summarize, the Coulombic energy of N charges in our approximation has the 
form (3) with the approximate long and short contributions given by (12), (13), 
correspondingly. 

The evaluation of the error fimction in (13) is computationally inefficient, 
compared to algebraic operations. Therefore, we replace the term that corre- 
sponds to error function by a spline approximation in the interval (0,rcut)- 




(12) 




(13) 
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Figure 1: Relative error in the self-interaction energy of a single Gaussian 
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The function erfc {Gr /\/2)/r has a singularity at 0. Therefore we rewrite it 



(14) 



erfc(Gr/V2)/r = l/r - erf (Gr/V2)/r 



and approximate with a spline only the second term which is regular. This ap- 
proximation holds for r < Tcut- For r > r^nt, the whole potential erfc [Gr / ^/2) / r 
is set to 0. 



6 Calculating the forces. 

The force acting on the i-th particle is given by the gradient of (3): 



short 



Tishort 



dU_ 
dvi 



-^erfc 

'''ij 



Gr,, 



V2 



(15) 



where = — ; the prime at the sum indicates that it is performed over j i. 
The short range forces (15) are equal to the negative analytical gradients of the 
short range energy (4). They are also calculated via a spline approximation. 
Actually, the spline that we use for (14) is obtained by integration of the spline 
for (15). 

Following [25], we also get the long range forces as the negative analytical 
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gradients of (12): 



Ij * Ij 

where 

q(r) = ^ = -2rG^(r) 
or 

is the vector of derivative Gaussians. The calculation of the long range forces 
(16) is called the force back interpolation. 

Taking into account the cutoff (11), one can rewrite the long range force as 

F'r^h' E q(l/i-r.)Pjifi-j. 

7 Linked cell list. 

In order to avoid the scanning of all particle pairs in (13) we make a linked 
subcell list [15]. The subcell size is smaller or equal to rcut/MMAX where 
MMAX is a positive integer ranging, in practice, from 1 to 3. The standard 
linked cell list algorithm corresponds to MMAX = 1; higher values of MMAX 
correspond to smaller cells such as those of [52], [53]. To account for the free 
BC we add MMAX layers of empty cells outside the original cell. The sum of 
forces for a given particle is then done over all particles in subcells which are 
within Tcut of the subcell where the original particle is located. 

The sum over subcells is done along the stripes in the direction x. In par- 
ticular, following [52], [53], we rearrange the particles so that the particles in 
the same subcell have consecutive numbers and the subcells are sorted in the x 
direction. The actual sorting can be avoided because we already have the linked 
subcell list. In more detail, 

• The first and last particle numbers in each subcell are stored in separate 
arrays HEAD and TAIL. 

• If the cell is empty, its HEAD value is that of the neighboring cell in the 
direction of increasing x coordinate. For the cells outside the simulation 
box (to the right of it), HEAD = iV + 1 by defauh. 

• The TAIL value of an empty cell is that of its neighbor in the direction 
of decreasing x. For the outside cells (to the left of the simulation box), 
TAIL = 0. 

• The loop over all particles in the cell stripe: 

(*1,«2,«3), («1 + 1,«2,«3), • ■ • , (.7l,«2,«3) 

is done as a loop the over particle numbers from HEAD(ii, 12, ia) to 
TAIL(ji, 12, Ja). In this way it does not matter whether empty cells occur. 
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The reordering of the particles has the additional advantage that the cache 
performance of the long range part is optimized. The reason is that the particles 
that have close indices are also close physically: jr^ — r^l is small when |j — j] is 
small. 

The loop in the charge assignment and force back interpolation goes over the 
particle numbers. The particles in the same subccU have overlapping screening 
charge densities, so the elements of the density/potential array are reused again 
and again, as the loop index goes over the particles in the same subcell. 

This makes the program significantly faster compared to the case without 
reordering, especially if the particles are distributed randomly. 

8 The results of the serial code. 

We chose two test systems: 

• N particles in the unit cube with random coordinates and charges equal 
to ±1, the total charge being zero (henceforth called the random system) 

• N particles with charges ±1 forming a rock salt crystal, filling the unit 
cube with M « N^^^ nodes at each side. The lattice constant is then 
d = {M — 1)^^. Each particle is then shifted away from its node by a 
vector with random coordinates in the range (— c?/3, c?/3). This system 
will be called the crystal one. 

In both cases, the particle number = 1000 * lO-'^'^; j = . . . 4. In other words, 
we consider the following values of N: 1000, 2154, 4642, 10000, 21544. 

We always use the scaling functions of the order 100. We found that using 
lower orders leads to a decline in accuracy, while the order of the scaling func- 
tions only affects the time of the calculation of the kernel which is done only 
once for a simulation. 

The values of Xcuti ^cut- G and h are free parameters. As explained e.g. 
in [57], such parameters should be chosen such that 

• The accuracy is fixed to some value. 

• The CPU time spent is minimal for the given accuracy. 

As the measure of the accuracy we use the mean square force error: 



where 

are the forces obtained by the full direct calculation. 

Unfortunately we cannot give the a priori error estimates of the accuracy. 
The error in the short-range forces is the same as that for the Ewald method, 
and is estimated in [56], [57]. However, our calculation of the long range forces 
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involves the use of a finite basis and the approximation in Eq. (9) , the accuracy 
of which is difficult to estimate. 

Therefore, for our two test systems we made tests with all reasonable val- 
ues of the parameters. The most important parameter is G, the width of the 
Gaussian in (1). We also introduced the following dimensionless quantities: the 
dimcnsionlcss Gaussian cutoff Gxcnt] the cutoff ratio rcut/a^cut, the dimension- 
less grid constant Gh and 

_ 4/3^rg„t 

-'Vshort = 77 JV. 

V'box 

The latter is the average number of particles inside a sphere of radius rcut ■ It 
turns out that the relative MSQ force error depends stronger on these dimen- 
sionless quantities and G than on the number of particles in the system. 

We checked 7-10 possible values for each parameter. From the resulting 
pool of results we selected the so called Pareto frontier. A point is on the 
Parcto frontier if there is no other point which has both smaller CPU time and 
smaller error. 

The Pareto frontiers for the values of = 1000 x 10^^^; j = ... 4 for the 
random system and the crystal systems arc given at the Fig. 2 and Fig. 3. The 
tests where done on an 1.8GHz AMD Opteron 244. 
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Figure 2: The Pareto frontiers for the random system on Opteron 244. 



For the purpose of illustration we present here the optimal parameter values 
for the crystal system. The values for the random one do not differ significantly. 

From Figs. 4,5 we see that the optimal values of Gxcut and Gh are deter- 
mined by the accuracy level. In contrast, the optimal value of A^shoit depends 
on the number of particles too. Therefore in an actual calculation it must be 
adjusted by the trial and error method to give optimal accuracy and CPU time. 



10 



10 



0.1 



0.01 



-TT 1 1 1-1 r- 



-1 r-| 1 1 i-p 

N=1000 — 
N=2154 -x- 
N=4642 

N=10000 B 

N=21544 



03 -B 



■El3- 



-EtJEIl- 



-□EE-. 




J-J I I l-J L 



1e-09 1e-08 1e-07 1e-06 1e-05 1e-04 
Relative MSQ force error 

Figure 3: The Pareto frontiers for the crystal system. 
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Figure 4: The optimal values of Gxcut, crystal system. 
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Figure 5: The optimal values of Gh, crystal system. 
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Figure 6: The optimal values of iVshort; crystal system. 
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Figure 7: Values of Tcut/^^cut for both random and crystal systems. 

This is similar to finding the optimal value of the Gaussian width in the standard 
particle- mesh schemes [24] . Finally, the optimal ratio rcut /a^cut was observed to 
be independent on the required accuracy but slightly decreased with rising N, 
as seen on Fig. 7. 

The Pareto frontiers allow us to determine the crossover points for each N: 
the values of MSQ force error for which our calculation takes the same time as 
the full direct one. They are presented at Fig. 8. We have plotted the crossover 
curves for the random system and for the crystal system. For comparison we 
have also plotted the crossover curve of the fast multipole method taken from [5] , 
where one of the best implementations of FMM is described. In that paper the 
same charge distribution was used as in our random system. Of course, the 
random positions of the particles where different in our tests and in those of [5] . 
However, we still see that our method has lower crossovers than the FMM for 
the same accuracy. 

9 The energy conservation. 

In contrast to other methods such as FMM [1], [34], our algorithm has the 
advantage that the approximate forces are exact analytic derivatives of the 
approximate energy. This allows for energy conservation during an MD run. 
To illustrate this, we make an MD simulation of a rock salt crystal formed by 
1000 Na and CI atoms. The particle positions and velocities are updated by the 
velocity Verlct algorithm. 

To get physically reasonable results, we made the particles interact through 
the Born-Mayer-Huggins-Fumi-Tosi (BMHFT) rigid-ion potential [58] that has 
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bonding terms in addition to the Coulombic force. 

At first we made the system equilibrate for 300 oscihation periods. We 
then monitored the potential and total energy for another 100 periods using the 
full direct algorithm. Then the last 100 periods were repeated using our P3S 
algorithm. 

On Fig. 9, 10 we plot the absolute values of deviations of the potential and 
total energy from their mean values. The ratio of the mean square deviation of 
the total energy to that of the potential one is found to be equal to 1.4 x 10~^. 

The P3S results are shown on the graphs; those of the full direct calculation 
are indistinguishable from the P3S ones. 



10 Parallelization. 

The parallelization of the calculation of the short-range energy and forces is 
straightforward: for the particle i and processor IPROC, we calculate the force 
only if MODULO(I,NPROC) = IPROC, where NPROC is the total number of 
processors. In the end, an MPLALLREDUCE command sums all the contribu- 
tions. 

For the long-range part, we rely on the parallel structure of the Poisson 
solver [51]. The charge density on the grid is divided into slabs in the x — y 
plane. Each such slab is the input density at a separate node. The Poisson 
equation (for the whole cell) is solved for that slab density. The output at a 
given node again contains only the corresponding slab of the grid, this time 
it is a piece of the potential array. Of course one needs global interprocessor 
communications, including MPLALLTOALL for the solution of the Poisson 
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Figure 10: The total energy fluctuations with the P3S method. 
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equation. 

Then in the original algorithm of [51], an MPLALLREDUCE command 
sums up the potentials of all the slabs. 

In our program for point particles, we kept the parallelization of the charge 
assignment as above: each processor receives only the grid charges from the 
corresponding slab. However, we do not use the MPI_ALLREDUCE command 
to get the potential. Instead, we used the slabwise structure of the output of 
the Poisson solver. For each node, the corresponding slab potential contributes 
only to the forces on particles that are close to it. 

In the end, we add up the forces with the MPI_ALLREDUCE command. 
This MPLALLREDUCE is actually merged in the program with the one for the 
short-range forces. 

In this way we minimize the interprocessor communication considerably com- 
pared e.g. to [22] since it is much easier to send the components of the forces 
than the pieces of the enormous potential array. 

To test the parallel program we ran it for the random system (the same as 
in the serial tests) with 10'"' particles. The relative accuracy of the forces was 
kept around 10~^. 

The result is given at Fig. 11. At the y axis we have the ratio of the CPU 
time spent on several processors to that spent by one processor. 
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Figure 11: The parallel speedup results on a CRAY XT3 for 10^ particles. 



11 Conclusion. 

We have developed a point particle Poisson solver algorithm that has a lower 
crossover point than the FMM. It can be considered as a generalization of the 
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particle- mesh solvers for free boundary conditions. It can also achieve high 
precision. 

The forces obtained by our program are analytical derivatives of the energy; 
this is an advantage in the context of MD simulations. An MPI parallelised 
version of the algorithm is presented that scales well on a moderate number of 
processors. 
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